Comparison standards shape everyday judgments of low and high wellbeing in individuals with and without psychopathology: a diary-based investigation

People can easily rate and express their current levels of wellbeing, but the cognitive foundations for such judgments are poorly understood. We examined whether comparisons to varying standards underlie fluctuating wellbeing judgments within-person (i.e., throughout daily episodes) and between-person (i.e., high vs. low levels of psychopathology). Clinical and non-clinical participants recorded subjective affect for each distinct episode for one week. Participants briefly described current, best, and worst daily episodes, which we coded for presence and type of comparison standard (social, past temporal, criteria-based, counterfactual, prospective temporal, and dimensional). Participants also rated their engagement with these standards and the respective affective impact. During best episodes, participants reported more downward (vs. upward) comparisons that resulted in positive affective impact. In worst episodes, upward (vs. downward) comparisons were more frequent. In best and worst episodes, we most frequently identified past-temporal and criteria-based comparisons, respectively. The clinical group engaged more often with all potential standard types during worst daily episodes and was more negatively affected by comparative thoughts, amid consistently more negative affect levels across all episode types. Our data suggest that judgments of affect and wellbeing may indeed rely on comparative thinking, whereby certain standards may characterize states of negative affect and poor mental health.

www.nature.com/scientificreports/With respect to within-person differences, we expected that episodes with relatively negative affect (i.e., episodes judged to be the worst daily episodes) would be characterized generally by more upward than downward comparisons, relative to episodes with relatively positive affect (i.e., episodes judged to be the best daily episodes).We further expected the clinical group to display more upward comparisons across episodes, relative to the non-clinical group.Moreover, we expected the clinical group to display relatively larger negative impact of comparisons on mood, i.e.: a more negative impact of comparisons during the worst episodes, and a less positive impact during best episodes.Due to a scarcity of research, we did not initially formulate firm predictions concerning group differences per type of comparison or type of episode.Based on prior studies with clinical populations 20,37,41 , we tentatively expected the clinical group to display more upward social, criteria-based, pasttemporal, and counterfactual comparisons, as well as a stronger negative influence on affect in clinically ill populations.Also on an exploratory basis, we investigated whether the association between self-reported comparison engagement and affective impact is more negative in the clinical than in the non-clinical group, extending our general hypothesis that comparisons are more detrimental for the clinical group.Finally, we explored whether the clinical group would engage in comparison with less concrete comparison standards.

Method Participants
The current study was completed by 25 individuals seeking psychotherapy in the outpatient clinic for psychotherapy at the University of Münster, as well as 25 healthy individuals without clinical symptomatology matched on age, sex, and level of education.All participants enrolled after having completed a separate study, in which we used a semi-structured interview to assess comparative thinking 41 .For three additional participants from the interview study who completed this study, diary data were lost (technical failure: n = 1; filled out less than 15% of measurements: n = 2).Hence, these participants and their non-clinical counterparts were not included in any of the analyses.Inclusion criteria for the clinical group were proficiency in German at an intermediate level or higher and diagnosis of one or more mental disorders established with the Structured Clinical Interview for DSM-IV (SCID-I) 43 .Exclusion criteria were acute suicide risk, self-injury, substance addiction, a body mass index (BMI) < 17.5, acute psychotic symptoms (to exclude altered states of consciousness or severe formal thought disorders that may complicate open answer coding, e.g., due to loose or derailed associations, neologisms, stilted language, etc.), and/or current psychiatric or psychological treatment.For the nonclinical group, diagnosis or treatment of a mental disorder served as an additional exclusion criterion, established by means of a self-report screening, corroborated by below-cutoff depression and anxiety scores (see below; for details of the recruitment process, see Morina et al. 41 ).Characteristics of the final sample are summarized in Table 1.Main diagnoses in the clinical group were mood (n = 8) and anxiety disorders (n = 8), adjustment disorder (n = 3), obsessive-compulsive disorder (n = 2), and individual cases of eating disorder, body dysmorphic disorder, post-traumatic stress disorder, and schizophrenia without acute psychotic symptoms.Participants received 8€/h in compensation for their time.All participants gave written informed consent.This study was approved by the ethical committee of University of Münster and was performed in accordance with ethical guidelines specified in the APA Code of Conduct as well as research ethics guidelines in Germany.

Symptom levels
Depressive symptoms were assessed using the Patient Health Questionnaire (PHQ-9) 44 , consisting of nine items to assess the symptom severity of depression over the last 2 weeks.Answers are scored on 4-point scales (0 = not   www.nature.com/scientificreports/ at all; 3 = nearly every day; α = 0.89).Generalized anxiety symptoms were assessed using the Generalized Anxiety Disorder 7 (GAD-7) 45 , comprising seven items about symptoms of generalized anxiety over the last 2 weeks.The GAD-7 is scored on 4-point scales (0 = not at all, 3 = nearly every day; α = 0.88).

Diurnal affect diary
Diurnal affect was assessed for a period of seven consecutive days using the Day Reconstruction Method (DRM) 42,46 .For this purpose, participants filled out a pen-and-paper diary, as well as a complemented web survey every evening.The pen-and-paper diary required participants to record all distinct episodes of the respective day (up to 30 per day if needed), to indicate start and end times, and to write down a brief description.They then indicated for each episode how they felt on 11-point scales (0 = very bad, 10 = very good).Afterwards, participants filled out the web survey, where they reported how they have felt since last filling out the diary, how they felt currently, during the best episode of the day, and during the worst episode of the day, on 11-point scales (0 = very bad, 10 = very good).For each of these three judgments, they were prompted to substantiate their rating using text boxes.In addition, for the best and worst episodes, they answered a series of items about cognitive engagement with potential comparison standards during the episode, i.e., whether they engaged in social (In this situation, did you think about other people?), past temporal (…your past?), prospective temporal (…your future?), criteriabased (…how things should be?), counterfactual (…how you would feel if certain things had happened or had not happened?),or dimensional cognitions (…how you are better in some things than in others?), on 11-point scales (0 = not at all, 10 = all the time).For each type of cognition, they also indicated whether it influenced their mood (− 5 = much worse, + 5 = much better).
To quantify diurnal affect characteristics, we calculated mean affect scores for current, best, and worst affect across the seven days.In addition, we calculated mean affect across all episodes reported in the paper-and-pen diary.We also calculated individual standard deviations and mean squared successive differences (MSSD) as indices of affect variability and temporal instability, respectively 46,47 .

Coding
Text responses substantiating the current, best, and worst affect ratings were entered into a coding scheme to classify whether the affect rating was justified by a reference to (1) an external event (yes/no), (2) an internal process like thoughts or feelings (yes/no), and (3) to the respondent's own activity (yes/no).Next, raters coded (4) the presence of a comparison (present/potentially present/absent); to be coded 'present' , the response had to mention (a) some comparison standard and (b) relate the standard to the self on some dimension 38 .If one condition (a or b) was unclear (i.e., not explicitly included in the response), comparison was coded as 'potentially present' .If one condition was not met, comparison was coded 'absent' .Present and potentially present comparisons were then coded for (5) direction (upward, downward, lateral), (6) comparison type (social, past temporal, criteria-based, counterfactual, prospective temporal, dimensional, contextual), and (7) specificity of the standard (0 = generic; 1 = intermediate; and 2 = specific).
Following consensus agreement between three raters on four transcripts (NM, TM, VM), two raters (VM, TM) coded aspects 4-7 independently of each other for all coding schemes.For aspects 1-3, 30% of the schemes were coded by both raters and the rest by only one rater (TM).Across episodes, mean absolute agreement between raters for aspects 1-3 was 0.871, 0.922, and 0.887, respectively.For aspects 4-6, absolute agreement was 0.801, 0.946, and 0.701, respectively, and 0.614 for the specificity ratings (7).Discrepancies were resolved by a third rater (NM).

Procedure
Candidates were screened for inclusion and exclusion criteria by telephone (clinical group) or by an online screening (non-clinical group).Eligible participants were then invited to a laboratory session, where they completed the GAD-7 and PHQ-9, a structured clinical interview (clinical group only), and a semi-structured interview about their subjective wellbeing relative to various comparison standards 41 .Afterwards, they received instructions regarding the pen-and-paper diary as well as the daily web surveys, which they then completed every evening for 7 consecutive days.Participants received the links to the daily web surveys via email, each time accompanied by a reminder SMS.Data collection took place in 2018-2019.

Statistical analysis
The main analyses in the present study focused on engagement with comparison standards (as identified by raters and according to self-report) and the engendered affective impact.Since rater-identified comparisons occurred very infrequently, it was deemed impractical to account for the nested data structure (occurrences within participants and within days) in our statistical approach.Therefore, we summed the number of comparisons per type and direction across all seven days, separately for current, best, and worst episodes, and separately for "present" and "potentially present" comparisons.These frequencies were then subjected to mixed ANOVAs with type of episode (current, best, worst) as within-subjects factor and group (clinical, non-clinical) as betweensubjects factor, in addition to either direction (upward, downward), or type of comparison (social vs. past temporal vs. criteria-based vs. counterfactual vs. prospective temporal vs. dimensional), as within-subject factor.
Next, self-reported engagement and affective impact of potential comparisons were analyzed using a linear mixed models (LMM) approach (model type: III sum of squares; test method: Satterthwaite), accounting for the nested data structure, with measurements nested within participants and within days.Main and interaction effects involving group, type of episode (best, worst), and type of standard, were entered as fixed factors, while intercepts of participants and of days were entered as random effects components.Contrasts were explored using Holm corrections.Two separate models addressed engagement with potential comparison standards and www.nature.com/scientificreports/affective impact, respectively.Finally, we used a model in which self-reported engagement was entered as a continuous fixed effects variable to predict self-reported affective impact.Interactions in LMM were followed up by contrasting estimated marginal means using Holm correction.When sphericity assumptions or variance homogeneity for ANOVA or t-tests were violated, Greenhouse-Geisser adjusted p values are reported along with the respective epsilon and uncorrected degrees of freedom.We report Cohen's d (in LMM, calculated from contrast t and df) and η 2 p as effect size estimates.The sample included completers of a prior interview study 41 and sample size for the present study was not based on a specific a-priori power analysis.At an alpha level of 0.05, our sample size was adequate for the detection of at least small-to-medium within-between interactions (e.g., f ≥ 0.205 for an Episode type × Group interaction) or large differences between two independent samples (d ≥ 0.81) with a power (1 − β) = 0.80 (as determined using G*Power V3.1.9).Group differences in sample characteristics were analyzed using IBM SPSS version 28.LMM analyses were conducted JASP 48 .The analyses were not preregistered.

Sample characteristics and diurnal affect
Sample characteristics, including mean levels of anxiety and depression, are depicted in Table 1.Number and duration of all recorded DRM episodes (i.e., all episodes that participants could distinguish during the 7-day period), as well as descriptive statistics on the reported valence across episodes, are summarized in Table 2.The reported valence was consistently higher in the non-clinical than in the clinical group for all episode types.This was further confirmed by a 2 (Episode type: best, worst) × 2 (Group: clinical, non-clinical) mixed ANOVA showing a main effect for Episode type, F(1,48) = 627.174,p < 0.001, η 2 p = 0.929, a main effect for group, F(1,48) = 9.059, p = 0.004, η 2 p = 0.159, in the absence of an interaction (p = 0.750, η 2 p = 0.002).

Rater-coded presence of comparisons
Overall frequencies Recall that we used narrative descriptions of the current, best and worst daily episode to code engagement in comparative behavior during these episodes.Table 3 displays the overall mean comparison numbers across all 21 distinct episodes identified by the raters, using either a strict coding algorithm (i.e.requiring explicit mention of a standard and of how the standard relates to oneself) or a more liberal algorithm (i.e., standard and self-relation were implied).As can be seen, only a few comparisons were identified with strict coding (overall M = 1.58,SD = 2.02), whereas there were more potential comparisons according to liberal coding (overall M = 3.90, SD = 3.33).Upward and downward comparisons were more frequently used than lateral comparisons.
In terms of standard type, comparisons to past temporal and criteria-based standards were most frequently used.Lateral and all other types of comparison were found too infrequently to consider in the following analyses.There were no group differences in comparison numbers, irrespective of comparison type, both when coded strictly (all ps > 0.36) and liberally (all ps > 0.12).

Self-reported engagement with potential comparison standards
The full models of all LMM analyses are provided via the Open Science Framework and can be inspected using the following link: https:// osf.io/ uwhbp/.For self-reported engagement frequencies, the LMM analysis revealed main effects of Episode type (i.e., more engagement during worst compared to best episodes; d = 0.43), F(1,4086.1)= 188.75,p < 0.001, and Standard type (social, past temporal, criteria-based, counterfactual,  1.In addition, there was a significant Episode type × Standard type interaction, F(5,4086.1)= 28.67,p < 0.001.There was no three-way interaction (p = 0.118).Contrasts confirmed the Episode type × Group interaction (p Holm < 0.001, d = 0.11), which was due to a relatively larger effect of Episode in the clinical group, but we identified no significant simple effects of Group (p Holm ≥ 0.247, d < 0.43).Regarding the Standard type × Group interaction, there were also no significant simple effects for Group per Standard type after correcting for multiple comparisons, whereby the largest descriptive group concerned engagement with criteria-based standards, (p Holm = 0.137, d = 0.55; uncorrected p = 0.023).An additional contrast exploring the Standard type × Group interaction indicated that the interaction can be partly attributed to differences between criteria-based and counterfactual standards.That is, the clinical group engaged relatively more in criteria-based than in counterfactual comparisons, which difference was smaller in the nonclinical group (p Holm < 0.001, d = 0.13).
Finally, we explored the Episode type × Standard type interaction with contrasts (across groups), revealing that engagement was generally larger during worst episodes for all standards, p Holm < 0.001, d > 0.10, with the exception of the social standard (p Holm = 0.159, d = 0.04).The largest simple effect of Episode type was present for the criteria-based standard, (p Holm < 0.001, d = 0.46).
Contrasts addressing the Episode type × Group interaction (p Holm < 0.001, d = 0.17) revealed that during worst episodes, the clinical group reported a more negative affective impact than the non-clinical group, p Holm = 0.036, d = − 0.64.No such difference was found in best episodes, p Holm = 0.366, d = 0.24.With respect to the Group × Standard type interaction, contrasts did not reveal any simple effect per Standard type, p Holm > 0.272.Rather, the interaction appeared to stem from a difference in affective impact between criteria-based and counterfactual standards, which was larger in the clinical than in the non-clinical group, p Holm < 0.001, d = 0.13.That is, the clinical group had a relatively more positive impact in response to counterfactual standards, or a relatively more negative impact to criteria-based standards, as compared with the non-clinical group.
The Episode type × Standard type interaction (across groups) indicated that affective impact was generally lower during worst episodes than during best episodes (p Holm < 0.001, d > 0.28), whereby the difference was the largest for the social standard (d = 0.75).Within best episodes, the social standard also had a more positive affective impact compared to all other standards (p Holm < 0.001, d = 0.34).Within worst episodes, criteria-based standards had a more negative affective impact compared to all other standards (p Holm < 0.001, d = − 0.28).

Conditional associations between engagement with standards and affective impact
The LMM analysis of self-reported affective impact with engagement as an additional continuous fixed effects variable revealed a Group × Episode type × Engagement interaction, F(1,4086.1)= 1421.76,p < 0.001.Furthermore, there was an Episode Type × Standard type × Engagement interaction, F(5,4068.8)= 13.95,p < 0.001.Contrasts

Discussion
We investigated whether judgments of subjective wellbeing are informed by different types of comparison, with standards varying within-person (i.e., best vs. worst daily episodes), and between clinical vs. non-clinical participants.Overall, best daily episodes were characterized by positive affective impact across all potential comparison standards (social, past temporal, criteria-based, counterfactual, prospective temporal, dimensional), and we most frequently identified downward and past-temporal comparisons.In contrast, worst episodes were characterized by negative affective impact across all potential standards, whereby we most often identified upward and criteria-based comparisons.Compared to non-clinical participants, the clinical group had consistently lower affect levels, engaged more often with criteria-based relative to other standards (e.g., counterfactual), and generally engaged more strongly with any type of comparison standard during worst compared to best episodes.Critically, engagement with potential comparison standards during worst episodes was also more strongly linked to negative affective impact in the clinical group.These findings suggest that comparative thinking underlies judgments of affect and wellbeing, whereby certain standards (e.g., criteria-based) may be particularly involved in negative affect and poor mental health.
Our data supported the expectation that episodes of low wellbeing (i.e., worst daily episodes) are characterized generally by more upward than downward comparisons, relative to episodes with high wellbeing (i.e., best daily episodes).This aligns with the typical finding in cross-sectional, experimental and diary studies that upward comparison results in worse affect while downward comparison is followed by better affect [32][33][34][35][36] , an effect that has been robustly found in social comparison studies 17 .Moreover, factor analytical studies indicate that for the majority of standards (e.g., social, past-temporal, criteria-based), upward comparison is generally motivationally aversive, whereas downward comparison is appetitive 22,26,31 .Thus, the present study replicates and extends these findings by showing that (a) people tend to invoke upward comparison to explain their wellbeing during worst daily episodes, whereas they virtually never invoke downward comparisons, (b) the opposite pattern was found for best daily episodes, (c) in worst and best episodes, people more likely mention certain criteria or their past, respectively, (d) engagement with all comparison standards has negative affective impact during worst and positive impact during best episodes, and (e) that these patterns can be found in both in clinical and non-clinical individuals.
Meanwhile, our analyses involving between-subjects effects may be somewhat surprising as they lend only limited support to the idea that individuals suffering from a mental disorder generally engage more frequently in unfavorable comparisons 20,37 .That is, amid consistently lower levels of wellbeing, we found no differences between groups regarding the number of comparisons they used to substantiate their wellbeing score during the different episode types.Instead, the clinical group reported generally more pronounced engagement with potential comparison standards in worst compared to best episodes, accompanied by more negative impact.Additionally, more subtle group differences indicate that the clinical group engages relatively more often in thoughts about criteria-based standards having a negative affective impact, than in counterfactual comparisons having a more positive impact.Moreover, there were no differences in the concreteness of the comparison standards, or type of explanation provided.This may seem surprising in light of our prior study using a wellbeing interview 41 , in which we found relatively large differences between the clinical and non-clinical groups, such as generally more comparisons in the clinical group (d = 1.46), and more upward comparisons in particular (d = 1.01).One explanation for the current findings is that these group differences do not translate to strong differences moment-to-moment comparative processes between clinical and non-clinical populations.This makes sense insofar as a one-off assessment concerning comparisons in relation to wellbeing invites more global and long-term considerations about one's life.In contrast, the focus of the present study-momentary wellbeing per episode-may invite thoughts about closer frames of reference (e.g., spatially, temporally) that may be relatively similar in clinical and non-clinical groups.Indeed, this aligns with the finding that both groups similarly refer to their own activities, external events, or internal thought processes, when explaining their current wellbeing without any significant differences regarding the frequency of comparisons.Importantly, lower judgments of wellbeing in the clinical group relative to the non-clinical group might also be the result of some individual comparisons exerting a stronger impact in the clinical group than the non-clinical group.Yet, our study was not designed to assess the impact of single comparisons on wellbeing judgments and hence this remains an open question for future research.

Limitations and considerations for future research
It is important to note that prior to completing this diary study, our participants had enrolled in an interview study addressing comparative thinking 41 .This may have led some participants to be more aware of their comparative thinking and behaviours than usual, despite our attempts to minimize demand and expectancy biases by (a) relying on open-ended rather than closed questions and (b) omitting explicit reference to comparisons when asking about engagement with the different standards.However, the finding that our open-ended questions per episode yielded a relatively low number of comparisons, even with liberal coding, suggests negligible demand characteristics.Instead, comparative thinking may often occur without people being aware of it.Hence, a different limitation is that our open-ended questions may have been relatively insensitive to capture comparative thinking.Indeed, a 5-min long interview appears to yield much higher numbers of comparison by allowing more fine-grained differentiation of comparison types, in particular when respondents are directly prompted about different standards 41 .Relatedly, we asked participants about engagement with potential comparison standards ("have you thought about other persons?"), but this may not always correspond to engagement in comparison, which additionally requires relating the standard to the self 49 .A more general potential limitation is that we did not control for various individual and sociodemographic factors that may influence reporting of comparative behaviour (e.g., intelligence, socioeconomic status, important biographical events), which need to be considered in future studies.Finally, our sample size was rather small to identify overall between-group differences, implying that statistical power may have been adequate only for between-within interactions and for larger effect sizes.Thus, absence of effects should not be inferred from our data-in particular with regards to smaller effects that may have gone undetected-and replication with larger samples is clearly warranted.
A noteworthy observation was that participants in the clinical group tended to report fewer daily and longer episodes than the non-clinical group (see Table 2).Although this difference was not significant, such a pattern could be explicable by a combination of lower levels of activity in patients along with possible distortions in memory recollection, such as a difficulty to retrieve specific episodes 50 .Therefore, future studies addressing comparative thinking in everyday life should take the potential roles of activity levels and memory processes into account.

Conclusions
Our study shows that fluctuating judgments of affect and wellbeing are associated with comparative thinking, such that upward and downward comparisons, criteria-based and past-temporal comparisons, and references to internal thoughts and own activities tend to characterize worse and best daily episodes, respectively.These diurnal patterns of changing comparison standards appear to be similar in clinical and non-clinical groups to a large extend, whereby the clinical group displayed a more pronounced engagement with aversive standards.Moreover, certain standards (e.g., criteria-based) may play a particularly important role in negative affect and poor mental health, warranting further careful research.Future studies may extend these findings by investigating more nuanced differences in comparative thinking, e.g., in specific patient subgroups or during certain types of daily episode. https://doi.org/10.1038/s41598-024-54681-x

Figure 1 .
Figure 1.Estimated mean frequencies of engagement per group in the best/worst daily episode (left panel) and with each standard type (right panel).The background data represent estimates per participant.For reasons of visibility, these estimates are aggregated across all days.Error bars denote 95% CI of the estimated marginal means.

Figure 2 .
Figure 2.Estimated mean affective impact per group in the best/worst daily episodes (left panel) and with each standard type (right panel).The background data represent estimates per participant.For reasons of visibility, these estimates are aggregated across all days.Error bars denote 95% CI of the estimated marginal means.

Table 3 .
Frequency of rater-coded comparisons per direction and standard type.Comparison numbers did not differ statistically between clinical and non-clinical group.= 62.26, p < 0.001, both of which were qualified by interactions with Group (clinical, non-clinical); Episode type × Group: F = 13.26,p < 0.001, Standard type × Group F = 6.14, p < 0.001, see Fig.